Melody: a note detection system

This is a program which tries to recognize notes from a melody and produce a visualization of it. Here is an example of what the result looks like with a short clip from a Beethoven sonata (the audio can be found below in the "Note onset detection" section of this notebook):

beethoven demo

There is a section at the bottom of the notebook with more examples. "Run all cells" may be helpful here since the examples require all the previous code cells to be executed first.

The general strategy

We're using a Fourier Transform based strategy where we break the audio into small segments and look at the FT of each segment to determine which note is being played in each. Currently, our approach doesn't generalize well to polyphonic music and is limited to pieces with one note being played at a time.

Before running on actual music, we "train" on a chromatic scale and save the FTs of each note in the scale. We'll use these FTs as reference points when trying to detect pitch. Going through this "training" instead of using a fixed frequency-to-note mapping can help to calibrate to the specific timbre and tuning of different instruments.

I. Note onset detection

We want to find the timestamps of the start of each note (this is only used for the training phase, when we break the chromatic scale down into individual notes).

As an example, there are nine notes in this clip (Beethoven Piano Sonata No. 31, mvt. 3, fugue theme), and we want to mark the onset of each of these.

Our approach involves picking out peaks in the raw audio waveform. Since each note produces a new sound, this tends to correlate to a spike in the audio. We're interested in these macro-level peaks:

But there are many other peaks (on the "micro"-level) caused by the sinusoidal nature of sound. We can see these micro-level peaks if we zoom in on a small part of the audio:

We want to work with a "zoomed out" version the audio so that we only see the macro-level peaks and ignore the micro-level ones. To do this, we can slide a window over the waveform, take the maximum amplitude of the audio within that window, and discard the rest of the data. This achieves "zooming out" in a sense by hiding the micro-level movements in the waveform and keeps only the general, macro trend.

The condense() function performs this step, returning the condensed data (orange line in the graph below) as well as timestamps in the original audio corresponding to the maximums of each window to be used later (green dots in the graph below).

Finally, we can pick out the peaks from the orange line (keeping only ones that are above a certain threshold) and, if necessary, merge adjacent peaks that are too close together.

Final result for onset detection:

II. Finding reference FTs

We take the chromatic scale and find the Fourier Transform of each note in it, storing the results in the list notes.

The Fourier Transforms are cut off after a certain frequency above which there shouldn't be much activity for musical notes in a reasonable range.

The resulting FT for the first note in the scale (C3):

We're mostly interested in the peaks in the above graph (representing the fundamental frequency of 131 Hz and its overtones), so we want to produce a "sharpened" version of this FT containing only these peaks.

We can reuse the peak-picking function get_peaks() used in onset detection:

Result after sharpening (of the same note C3):

III. Putting everything to use: note detection

For the given audio clip, we can slide a small window over it and find the FT of each window. We then compare this FT with the ref_fts to find the most likely note. We're treating the FT of this window as a vector $\mathbf{v}$ and finding the reference FT $\mathbf{r}$ that matches its direction the most (maximizing $\frac{\mathbf{v} \cdot \mathbf{r}}{\| \mathbf{r} \|}$).

The Fourier Transform is a linear transformation that expresses a signal in the basis of sinusoids of different frequencies; this note detection step is also based on a linear change-of-basis which switches from a basis of frequency to a basis of musical notes. Previously, we have tried to combine these two transformations, skipping the Fourier step and directly projecting the signal onto vectors representing raw waveforms of each note. However, the results were not nearly as promising. Part of this difference may likely be due to the FT sharpening step we perform between the two transformations.

Notes that are an octave apart are harder to distinguish, so we add an "octave correction" step at the end of our detection function. This step finds the most prominent frequency in the FT (including its overtones) and compares it to the frequency in the reference FT to try to determine the right octave (each octave up doubles the frequency).

The find_strongest_freq() function estimates the frequency of the note in a given FT. For each peak in the FT, the function estimates its "strength" by adding up the strengths of all integer multiples of that frequency (with more weight given to the first few harmonics). It's roughly accurate but not quite enough to be used directly for note detection, so we only use it for octave correction.

The final function for running note detection on a file and producing the visualization:

IV. Examples

Alan Menken, "The Bells of Notre Dame" from The Hunchback of Notre Dame soundtrack:

Jonathan Coulton, "Still Alive" from the Portal soundtrack:

A chromatic scale (one that's faster than the one used for "training") to test a range of notes:

Beethoven, Piano Sonata No. 32, mvt. 2 (here, the first six notes are new notes for which we don't have reference FTs for; they are in a higher octave which was not included in the training scale, and the octave correction step was able to help correctly identify them):

Beethoven, Symphony No. 9, mvt. 4, "Ode to Joy" theme:

William Bolcom, "Graceful Ghost Rag":